21 research outputs found
Deep learning for information extraction in the biomedical domain
Mención Internacional en el título de doctorThe main hypothesis of this PhD dissertation is that novel Deep Learning algorithms can outperform classical Machine Learning methods for the task of Information Extraction in the Biomedical Domain. Contrary to classical systems, Deep Learning models can learn the representation of the data automatically without an expert domain knowledge and avoid the tedious and time-consuming task of defining relevant features.
A Drug-Drug Interaction (DDI), which is an essential subset of Adverse Drug Reaction (ADR),
represents the alterations in the effects of drugs that were taken simultaneously. The early
recognition of interacting drugs is a vital process that prevents serious health problems that can
cause death in the worst cases. Health-care professionals and researchers in this domain find the
task of discovering information about these incidents very challenging due to the vast number
of pharmacovigilance documents. For this reason, several shared tasks and datasets have been
developed in order to solve this issue with automated annotation systems with the capability
to extract this information. In the present document, the DDI corpus, which is an annotated
dataset of DDIs, is used with Deep Learning architectures without any external information
for the tasks of Name Entity Recognition and Relation Extraction in order to validate the
hypothesis. Furthermore, some other datasets are tested to evidence the performance of these
systems.
To sum up, the results suggest that the most common Deep Learning methods like Convolutional
Neural Networks and Recurrent Neural Networks overcome the traditional algorithms
concluding that Deep Learning is a real alternative for a specific and complex scenario like the
Information Extraction in the Biomedical domain. As a final goal, a complete architecture that
covers the two tasks is developed to structure the named entities and their relationships from
raw pharmacological texts.This thesis has been supported by:
Pre-doctoral research training scholarship of the Carlos III University of Madrid (PIF UC3M 02-1415) for four years.
Research Program of the Ministry of Economy and Competitiveness - Government of Spain, (DeepEMR project TIN2017-87548-C2-1-R).
Research Program of the Ministry of Economy and Competitiveness - Government of Spain, (eGovernAbility-Access project TIN2014-52665-C2-2-R).
Doctoral stay TEAM - Technologies for information and communication, Europe - east
Asia Mobilities project (Erasmus Mundus Action 2-Strand 2 Programme) funded by the
European Commission realized in the University of Tokyo, Japan, for the Aizawa Laboratory in National Institute of Informatics (NII) for seven months.PublicadoPrograma de Doctorado en Ciencia y Tecnología Informática por la Universidad Carlos III de MadridPresidente: Ricardo Aler Mur.- Secretario: Alberto Díaz Esteban.- Vocal: María Herrero Zaz
Clasificación de escenas en contenido audiovisual
En este proyecto se han diseñado algoritmos para la clasificación de escenas mediante categorías genéricas tales como: costa, calle, montaña, cielo, oficina, casa, etc. El objetivo de los algoritmos desarrollados es ser integrado en un sistema de anotación automático de contenidos visuales. Para ello, se realiza un estudio comparativo y sistemático de dos alternativas del estado del arte: el método de extracción de esencia de imágenes, llamado gist descriptor, y el método bag of words para la extracción de características. Para la clasificación se utiliza, además, una máquina de vector soporte. La motivación del proyecto es, en primer lugar, desarrollar un sistema que pueda ser capaz de clasificar una gran base de datos de imágenes según el tipo de escena visual, y, en segundo lugar, que el estudiante conozca y sepa utilizar diferentes métodos de clasificación de escenas audiovisuales, pueda evaluarlas y tomar conclusiones. ___________________________________________________________________________________________________________________________In this project I had designed algorithms for scene classification using generic categories such as: coast road, mountain, sky, office, home, etc.. The developed algorithms have been integrated into a system for automatic annotation of visual content. For this purpose, I have performed a comparative and systematic study of two state of the art alternatives: the gist extraction method of images, called gist descriptor, and the bag of words method for feature extraction. For the classification, additionaly, I use a support vector machine. This project aims to, on the one hand, develop a system that will be able to classify a large database of images according to the type of audiovisual scene, and, on the other, that the student learns and knows how to use different methods of classification for visual scenes, as well as he can evaluate and draw his own conclusions from the results.Ingeniería Técnica en Sonido e Image
A multi-BERT hybrid system for named entity recognition in Spanish radiology reports
The present work describes the proposed methods by the EdIE-KnowLab team in Information Extraction Task of CLEF eHealth 2021, SpRadIE Task 1. This task focuses on detecting and classifying relevant mentions in ultrasonography reports. The architecture developed is an ensemble of multiple BERT (multi-BERT) systems, one per each entity type, together with a generated dictionary and available off-the-shelf tools, Google Healthcare Natural Language API and GATECloud's Measurement Expression Annotator system, applied to the documents translated into English with word alignment from the neural machine translation tool, Microsoft Translator API. Our best system configuration (multi-BERT with a dictionary) achieves 85.51% and 80.04% F1 for Lenient and Exact metrics, respectively. Thus, the system ranked first out of 17 submissions from 7 teams that participated in this shared task. Our system also achieved the best Recall merging the previous predictions to the results given by English-translated texts and cross-lingual word alignment (83.87% Lenient match and 78.71% Exact match). The overall results demonstrate the potential of pre-trained language models and cross-lingual word alignment for limited corpus and low-resource NER in the clinical domain
A two-stage deep learning approach for extracting entities and relationships from medical texts
This Work Presents A Two-Stage Deep Learning System For Named Entity Recognition (Ner) And Relation Extraction (Re) From Medical Texts. These Tasks Are A Crucial Step To Many Natural Language Understanding Applications In The Biomedical Domain. Automatic Medical Coding Of Electronic Medical Records, Automated Summarizing Of Patient Records, Automatic Cohort Identification For Clinical Studies, Text Simplification Of Health Documents For Patients, Early Detection Of Adverse Drug Reactions Or Automatic Identification Of Risk Factors Are Only A Few Examples Of The Many Possible Opportunities That The Text Analysis Can Offer In The Clinical Domain. In This Work, Our Efforts Are Primarily Directed Towards The Improvement Of The Pharmacovigilance Process By The Automatic Detection Of Drug-Drug Interactions (Ddi) From Texts. Moreover, We Deal With The Semantic Analysis Of Texts Containing Health Information For Patients. Our Two-Stage Approach Is Based On Deep Learning Architectures. Concretely, Ner Is Performed Combining A Bidirectional Long Short-Term Memory (Bi-Lstm) And A Conditional Random Field (Crf), While Re Applies A Convolutional Neural Network (Cnn). Since Our Approach Uses Very Few Language Resources, Only The Pre-Trained Word Embeddings, And Does Not Exploit Any Domain Resources (Such As Dictionaries Or Ontologies), This Can Be Easily Expandable To Support Other Languages And Clinical Applications That Require The Exploitation Of Semantic Information (Concepts And Relationships) From Texts...This work was supported by the Research Program of the Ministry of Economy and Competitiveness - Government of Spain, (DeepEMR project TIN2017-87548-C2-1-R)
Rare Disease Identification from Clinical Notes with Ontologies and Weak Supervision
The identification of rare diseases from clinical notes with Natural Language
Processing (NLP) is challenging due to the few cases available for machine
learning and the need of data annotation from clinical experts. We propose a
method using ontologies and weak supervision. The approach includes two steps:
(i) Text-to-UMLS, linking text mentions to concepts in Unified Medical Language
System (UMLS), with a named entity linking tool (e.g. SemEHR) and weak
supervision based on customised rules and Bidirectional Encoder Representations
from Transformers (BERT) based contextual representations, and (ii)
UMLS-to-ORDO, matching UMLS concepts to rare diseases in Orphanet Rare Disease
Ontology (ORDO). Using MIMIC-III US intensive care discharge summaries as a
case study, we show that the Text-to-UMLS process can be greatly improved with
weak supervision, without any annotated data from domain experts. Our analysis
shows that the overall pipeline processing discharge summaries can surface rare
disease cases, which are mostly uncaptured in manual ICD codes of the hospital
admissions.Comment: 5 pages, 3 figures, accepted for IEEE EMBC 202
Ontology-driven and weakly supervised rare disease identification from clinical notes
BACKGROUND: Computational text phenotyping is the practice of identifying patients with certain disorders and traits from clinical notes. Rare diseases are challenging to be identified due to few cases available for machine learning and the need for data annotation from domain experts. METHODS: We propose a method using ontologies and weak supervision, with recent pre-trained contextual representations from Bi-directional Transformers (e.g. BERT). The ontology-driven framework includes two steps: (i) Text-to-UMLS, extracting phenotypes by contextually linking mentions to concepts in Unified Medical Language System (UMLS), with a Named Entity Recognition and Linking (NER+L) tool, SemEHR, and weak supervision with customised rules and contextual mention representation; (ii) UMLS-to-ORDO, matching UMLS concepts to rare diseases in Orphanet Rare Disease Ontology (ORDO). The weakly supervised approach is proposed to learn a phenotype confirmation model to improve Text-to-UMLS linking, without annotated data from domain experts. We evaluated the approach on three clinical datasets, MIMIC-III discharge summaries, MIMIC-III radiology reports, and NHS Tayside brain imaging reports from two institutions in the US and the UK, with annotations. RESULTS: The improvements in the precision were pronounced (by over 30% to 50% absolute score for Text-to-UMLS linking), with almost no loss of recall compared to the existing NER+L tool, SemEHR. Results on radiology reports from MIMIC-III and NHS Tayside were consistent with the discharge summaries. The overall pipeline processing clinical notes can extract rare disease cases, mostly uncaptured in structured data (manually assigned ICD codes). CONCLUSION: The study provides empirical evidence for the task by applying a weakly supervised NLP pipeline on clinical notes. The proposed weak supervised deep learning approach requires no human annotation except for validation and testing, by leveraging ontologies, NER+L tools, and contextual representations. The study also demonstrates that Natural Language Processing (NLP) can complement traditional ICD-based approaches to better estimate rare diseases in clinical notes. We discuss the usefulness and limitations of the weak supervision approach and propose directions for future studies
A systematic review of natural language processing applied to radiology reports
NLP has a significant role in advancing healthcare and has been found to be
key in extracting structured information from radiology reports. Understanding
recent developments in NLP application to radiology is of significance but
recent reviews on this are limited. This study systematically assesses recent
literature in NLP applied to radiology reports. Our automated literature search
yields 4,799 results using automated filtering, metadata enriching steps and
citation search combined with manual review. Our analysis is based on 21
variables including radiology characteristics, NLP methodology, performance,
study, and clinical application characteristics. We present a comprehensive
analysis of the 164 publications retrieved with each categorised into one of 6
clinical application categories. Deep learning use increases but conventional
machine learning approaches are still prevalent. Deep learning remains
challenged when data is scarce and there is little evidence of adoption into
clinical practice. Despite 17% of studies reporting greater than 0.85 F1
scores, it is hard to comparatively evaluate these approaches given that most
of them use different datasets. Only 14 studies made their data and 15 their
code available with 10 externally validating results. Automated understanding
of clinical narratives of the radiology reports has the potential to enhance
the healthcare process but reproducibility and explainability of models are
important if the domain is to move applications into clinical use. More could
be done to share code enabling validation of methods on different institutional
data and to reduce heterogeneity in reporting of study properties allowing
inter-study comparisons. Our results have significance for researchers
providing a systematic synthesis of existing work to build on, identify gaps,
opportunities for collaboration and avoid duplication
Evaluation of pooling operations in convolutional architectures for drug-drug interaction extraction
Background: Deep Neural Networks (DNN), in particular, Convolutional Neural Networks (CNN), has recently achieved state-of-art results for the task of Drug-Drug Interaction (DDI) extraction. Most CNN architectures incorporate a pooling layer to reduce the dimensionality of the convolution layer output, preserving relevant features and removing irrelevant details. All the previous CNN based systems for DDI extraction used max-pooling layers. Results: In this paper, we evaluate the performance of various pooling methods (in particular max-pooling, average-pooling and attentive pooling), as well as their combination, for the task of DDI extraction. Our experiments show that max-pooling exhibits a higher performance in F1-score (64.56%) than attentive pooling (59.92%) and than average-pooling (58.35%). Conclusions: Max-pooling outperforms the others alternatives because is the only one which is invariant to the special pad tokens that are appending to the shorter sentences known as padding. Actually, the combination of max-pooling and attentive pooling does not improve the performance as compared with the single max-pooling technique.Publication of this article was supported by the Research Program of the Ministry of Economy and Competitiveness - Government of Spain, (DeepEMR project TIN2017-87548-C2-1-R) and the TEAM project (Erasmus Mundus Action 2-Strand 2 Programme) funded by the European Commission